In this workbook we'll start to analyze our data diachronically. To do that, we'll make use of a special object called a Corpus that provides some handy methods for organizing and indexing our Papers. We'll use the slice method to index our Papers temporally and across journals, and plot the distribution of features across those slice axes.
We'll use the Web of Science dataset that we used in our last workbook. Since this is a new workbook, we'll have to load it again.
In [ ]:
from tethne.readers import wos
datadirpath = '/Users/erickpeirson/Downloads/datasets/wos'
MyCorpus = wos.read(datadirpath)
Think of a Corpus as a container for your Papers. Your Papers are still here; you can access your Papers via the papers attribute.
In [ ]:
MyCorpus.papers
In [ ]:
from tethne.networks import authors
cg = authors.coauthors(MyCorpus)
print 'This graph has {0} nodes and {1} edges, just like before!'.format(len(cg.nodes()), len(cg.edges()))
Your Papers are also indexed. For WoS datasets, they are indexed by wosid (UT in the original field-tagged data file).
In [ ]:
MyCorpus.indexed_papers.keys()[:10] # The first 10 keys in the Paper index.
So if you know the wosid of a Paper, you can retrieve it from Corpus.papers:
In [ ]:
MyCorpus.indexed_papers['WOS:000305886800001']
Often we're interested in how networks evolve over time. In Tethne, you can access Papers in a time-variant fashion using the slice method. slice returns a generator that yields dates and subcorpora.
In [ ]:
[i for i in MyCorpus.slice('date') ]
The default behavior is to divide your Papers up into 1-year time-periods. You can visualize the distribution of Papers over time using the plot_distribution method.
In [9]:
MyCorpus.axes['date'].keys()
Out[9]:
In [7]:
fig = MyCorpus.plot_distribution('date')
You can change how slice divides up your corpus temporally using the method, window_size, step_size, and cumulative keyword arguments.
Here are some methods for slicing a Corpus, which you can specify using the method keyword argument.
| Method | Description |
|---|---|
time_window |
Slices data using a sliding time-window. Dataslices are indexed by the start of the time-window. |
time_period |
Slices data into time periods of equal length. Dataslices are indexed by the start of the time period. |
The main difference between the sliding time-window (time_window) and the time-period (time_period) slicing methods are whether the resulting periods can overlap. Whereas time-period slicing divides data into subsets by sequential non-overlapping time periods, subsets generated by time-window slicing can overlap.
Time-period slicing, with a window-size of 4 years:
Time-window slicing, with a window-size of 4 years and a step-size of 1 year:
Here's what plot_distribution looks like using a sliding time-window of 4 years:
In [10]:
MyCorpus.slice('date', method='time_window', window_size=4)
In [11]:
fig = MyCorpus.plot_distribution('date')
Note that it's a bit smoother, and the per-slice counts are much higher over all (max of ~700/slice, versus ~200/slice before).
Setting cumulative=True means that all of the Papers in the slice at time 0 will be included in the slice at time 1, and so on.
In [12]:
MyCorpus.slice('date', cumulative=True)
In [13]:
fig = MyCorpus.plot_distribution('date')
Before we get to the stage of networking, we can visualize how certain features are distributed in our Corpus. We'll develop the concept of features further along in these tutorials.
In Tethne, a feature is anything (a categorical variable, scalar, etc) that can be distributed over Papers. A cited reference, for example, can be a feature. So can a word. We can think of features in terms of the presence or absence of something (e.g. a cited reference), or in terms of a quantity (e.g. the number of times the word 'organism' appears in a Paper).
A featureset is, as the name suggests, a collection of similar features. For example, we can think of all of the cited references in our Corpus as a featureset. A vocabulary of words can also be a featureset.
Each Corpus has an attribute called features that holds featuresets. It's a dictionary, so you can see what featuresets your Corpus contains by calling Corpus.features.keys:
In [14]:
MyCorpus.features.keys()
Out[14]:
So far our Corpus contains only a featureset called citations. Each featureset has an index of features that it contains.
In [15]:
print 'There are {0} cited references in this Corpus.'.format(len(MyCorpus.features['citations']['index']))
MyCorpus.features['citations']['index'].items()[0:10] # Only viewing the first 10 items, since there are so many.
Out[15]:
We can view the distribution of a feature across the slices in our Corpus using plot_distribution. This is a bit more complicated than before. We first create a dictionary that describes the feature we're interested in, and then we pass it and the mode='features' keyword argument to plot_distribution.
In [17]:
fkwargs = {
'featureset': 'citations',
'feature': 'FALCONER DS 1996 INTRO QUANTITATIVE G', # An infamous textbook on quant-gen!
'mode': 'counts', # The frequency of the citation in each slice.
'normed': True, # Normalized by the number of papers in each slice.
}
In [18]:
MyCorpus.slice('date') # We'll go back to 1-year time-period slicing for now.
fig = MyCorpus.plot_distribution('date', mode='features', fkwargs=fkwargs)
We can also slice our Corpus using other elements of our data, such as journal names. To slice by journal name, try:
In [19]:
MyCorpus.slice('jtitle') # jtitle stands for 'journal title'
We can use plot_distribution took look at the distribution of Papers across journals. In this example I use MatPlotLib's pyplot module to control the layout of the figure (it's pretty huge).
In [16]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(50,145))
fig = MyCorpus.plot_distribution(y_axis='jtitle', fig=fig, aspect=0.6, interpolation='none')
And we can view the distribution of a feature, like citations of Falconer's Quantitative Genetics textbook, across journals in a similar way:
In [17]:
fig = plt.figure(figsize=(50,145))
fig = MyCorpus.plot_distribution(y_axis='jtitle', fig=fig, aspect=0.6, interpolation='none', mode='features', fkwargs=fkwargs)
Of course, these figures are enormous and not very informative as is. The idea is that you can use this as a starting point for your analysis, and ultimately generate your own plots.
In [ ]: